🚢 Titanic Survival Prediction 🚢

titanic_image.jpg

📚 Importing Libraries 📚

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

⏳ Loading the dataset ⏳

In [2]:
titanic = pd.read_csv('titanic.csv')

🧠 Understanding of data 🧠

In [3]:
titanic.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [4]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB

🧹 Data Cleaning 🧹

In [5]:
# Checking null values

titanic.isna().sum()
Out[5]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
In [6]:
# Handling the null values 

columns = ['Age', 'Fare']
for col in columns:
    titanic[col].fillna(titanic[col].median(), inplace = True)
    
titanic['Cabin'].fillna('Unknown', inplace=True)
In [7]:
#checking duplicate values

dup = titanic.duplicated().sum()
print("The number of duplicated values in the dataset are: ", dup)
The number of duplicated values in the dataset are:  0
In [8]:
#Checking if there are any typos

for col in titanic.select_dtypes(include = "object"):
    print(f"Name of Column: {col}")
    print(titanic[col].unique())
    print('\n', '-'*60, '\n')
Name of Column: Name
['Kelly, Mr. James' 'Wilkes, Mrs. James (Ellen Needs)'
 'Myles, Mr. Thomas Francis' 'Wirz, Mr. Albert'
 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)'
 'Svensson, Mr. Johan Cervin' 'Connolly, Miss. Kate'
 'Caldwell, Mr. Albert Francis'
 'Abrahim, Mrs. Joseph (Sophie Halaut Easu)' 'Davies, Mr. John Samuel'
 'Ilieff, Mr. Ylio' 'Jones, Mr. Charles Cresson'
 'Snyder, Mrs. John Pillsbury (Nelle Stevenson)' 'Howard, Mr. Benjamin'
 'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)'
 'del Carlo, Mrs. Sebastiano (Argenia Genovesi)' 'Keane, Mr. Daniel'
 'Assaf, Mr. Gerios' 'Ilmakangas, Miss. Ida Livija'
 'Assaf Khalil, Mrs. Mariana (Miriam")"' 'Rothschild, Mr. Martin'
 'Olsen, Master. Artur Karl' 'Flegenheim, Mrs. Alfred (Antoinette)'
 'Williams, Mr. Richard Norris II'
 'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)'
 'Robins, Mr. Alexander A' 'Ostby, Miss. Helene Ragnhild'
 'Daher, Mr. Shedid' 'Brady, Mr. John Bertram' 'Samaan, Mr. Elias'
 'Louch, Mr. Charles Alexander' 'Jefferys, Mr. Clifford Thomas'
 'Dean, Mrs. Bertram (Eva Georgetta Light)'
 'Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"'
 'Mock, Mr. Philipp Edmund'
 'Katavelas, Mr. Vassilios (Catavelas Vassilios")"' 'Roth, Miss. Sarah A'
 'Cacic, Miss. Manda' 'Sap, Mr. Julius' 'Hee, Mr. Ling' 'Karun, Mr. Franz'
 'Franklin, Mr. Thomas Parham' 'Goldsmith, Mr. Nathan'
 'Corbett, Mrs. Walter H (Irene Colvin)'
 'Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)'
 'Peltomaki, Mr. Nikolai Johannes' 'Chevre, Mr. Paul Romaine'
 'Shaughnessy, Mr. Patrick'
 'Bucknell, Mrs. William Robert (Emma Eliza Ward)'
 'Coutts, Mrs. William (Winnie Minnie" Treanor)"'
 'Smith, Mr. Lucien Philip' 'Pulbaum, Mr. Franz'
 'Hocking, Miss. Ellen Nellie""' 'Fortune, Miss. Ethel Flora'
 'Mangiavacchi, Mr. Serafino Emilio' 'Rice, Master. Albert'
 'Cor, Mr. Bartol' 'Abelseth, Mr. Olaus Jorgensen'
 'Davison, Mr. Thomas Henry' 'Chaudanson, Miss. Victorine'
 'Dika, Mr. Mirko' 'McCrae, Mr. Arthur Gordon'
 'Bjorklund, Mr. Ernst Herbert' 'Bradley, Miss. Bridget Delia'
 'Ryerson, Master. John Borie'
 'Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)'
 'Burns, Miss. Mary Delia' 'Moore, Mr. Clarence Bloomfield'
 'Tucker, Mr. Gilbert Milligan Jr' 'Fortune, Mrs. Mark (Mary McDougald)'
 'Mulvihill, Miss. Bertha E' 'Minkoff, Mr. Lazar'
 'Nieminen, Miss. Manta Josefina' 'Ovies y Rodriguez, Mr. Servando'
 'Geiger, Miss. Amalie' 'Keeping, Mr. Edwin' 'Miles, Mr. Frank'
 'Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)'
 'Aldworth, Mr. Charles Augustus' 'Doyle, Miss. Elizabeth'
 'Boulos, Master. Akar' 'Straus, Mr. Isidor' 'Case, Mr. Howard Brown'
 'Demetri, Mr. Marinko' 'Lamb, Mr. John Joseph' 'Khalil, Mr. Betros'
 'Barry, Miss. Julia' 'Badman, Miss. Emily Louisa'
 "O'Donoghue, Ms. Bridget" 'Wells, Master. Ralph Lester'
 'Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)'
 'Pedersen, Mr. Olaf' 'Davidson, Mrs. Thornton (Orian Hays)'
 'Guest, Mr. Robert' 'Birnbaum, Mr. Jakob' 'Tenglin, Mr. Gunnar Isidor'
 'Cavendish, Mrs. Tyrell William (Julia Florence Siegel)'
 'Makinen, Mr. Kalle Edvard' 'Braf, Miss. Elin Ester Maria'
 'Nancarrow, Mr. William Henry'
 'Stengel, Mrs. Charles Emil Henry (Annie May Morris)'
 'Weisz, Mr. Leopold' 'Foley, Mr. William'
 'Johansson Palmquist, Mr. Oskar Leander'
 'Thomas, Mrs. Alexander (Thamine Thelma")"' 'Holthen, Mr. Johan Martin'
 'Buckley, Mr. Daniel' 'Ryan, Mr. Edward'
 'Willer, Mr. Aaron (Abi Weller")"' 'Swane, Mr. George'
 'Stanton, Mr. Samuel Ward' 'Shine, Miss. Ellen Natalia'
 'Evans, Miss. Edith Corse' 'Buckley, Miss. Katherine'
 'Straus, Mrs. Isidor (Rosalie Ida Blun)' 'Chronopoulos, Mr. Demetrios'
 'Thomas, Mr. John' 'Sandstrom, Miss. Beatrice Irene'
 'Beattie, Mr. Thomson' 'Chapman, Mrs. John Henry (Sara Elizabeth Lawry)'
 'Watt, Miss. Bertha J' 'Kiernan, Mr. John'
 'Schabert, Mrs. Paul (Emma Mock)' 'Carver, Mr. Alfred John'
 'Kennedy, Mr. John' 'Cribb, Miss. Laura Alice' 'Brobeck, Mr. Karl Rudolf'
 'McCoy, Miss. Alicia' 'Bowenur, Mr. Solomon' 'Petersen, Mr. Marius'
 'Spinner, Mr. Henry John' 'Gracie, Col. Archibald IV'
 'Lefebre, Mrs. Frank (Frances)' 'Thomas, Mr. Charles P'
 'Dintcheff, Mr. Valtcho' 'Carlsson, Mr. Carl Robert'
 'Zakarian, Mr. Mapriededer' 'Schmidt, Mr. August' 'Drapkin, Miss. Jennie'
 'Goodwin, Mr. Charles Frederick' 'Goodwin, Miss. Jessie Allis'
 'Daniels, Miss. Sarah' 'Ryerson, Mr. Arthur Larned'
 'Beauchamp, Mr. Henry James'
 'Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey")"'
 'Vander Planke, Mr. Julius' 'Hilliard, Mr. Herbert Henry'
 'Davies, Mr. Evan' 'Crafton, Mr. John Bertram' 'Lahtinen, Rev. William'
 'Earnshaw, Mrs. Boulton (Olive Potter)' 'Matinoff, Mr. Nicola'
 'Storey, Mr. Thomas' 'Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)'
 'Asplund, Master. Filip Oscar' 'Duquemin, Mr. Joseph' 'Bird, Miss. Ellen'
 'Lundin, Miss. Olga Elida' 'Borebank, Mr. John James'
 'Peacock, Mrs. Benjamin (Edith Nile)' 'Smyth, Miss. Julia'
 'Touma, Master. Georges Youssef' 'Wright, Miss. Marion'
 'Pearce, Mr. Ernest' 'Peruschitz, Rev. Joseph Maria'
 'Kink-Heilmann, Mrs. Anton (Luise Heilmann)' 'Brandeis, Mr. Emil'
 'Ford, Mr. Edward Watson'
 'Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)'
 'Hellstrom, Miss. Hilda Maria' 'Lithman, Mr. Simon' 'Zakarian, Mr. Ortin'
 'Dyker, Mr. Adolf Fredrik' 'Torfa, Mr. Assad'
 'Asplund, Mr. Carl Oscar Vilhelm Gustafsson' 'Brown, Miss. Edith Eileen'
 'Sincock, Miss. Maude' 'Stengel, Mr. Charles Emil Henry'
 'Becker, Mrs. Allen Oliver (Nellie E Baumgardner)'
 'Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)'
 'McCrie, Mr. James Matthew' 'Compton, Mr. Alexander Taylor Jr'
 'Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)'
 'Lane, Mr. Patrick'
 'Douglas, Mrs. Frederick Charles (Mary Helene Baxter)'
 'Maybery, Mr. Frank Hubert' 'Phillips, Miss. Alice Frances Louisa'
 'Davies, Mr. Joseph' 'Sage, Miss. Ada' 'Veal, Mr. James'
 'Angle, Mr. William A' 'Salomon, Mr. Abraham L'
 'van Billiard, Master. Walter John' 'Lingane, Mr. John'
 'Drew, Master. Marshall Brines' 'Karlsson, Mr. Julius Konrad Eugen'
 'Spedden, Master. Robert Douglas' 'Nilsson, Miss. Berta Olivia'
 'Baimbrigge, Mr. Charles Robert'
 'Rasmussen, Mrs. (Lena Jacobsen Solvang)' 'Murphy, Miss. Nora'
 'Danbom, Master. Gilbert Sigvard Emanuel' 'Astor, Col. John Jacob'
 'Quick, Miss. Winifred Vera' 'Andrew, Mr. Frank Thomas'
 'Omont, Mr. Alfred Fernand' 'McGowan, Miss. Katherine'
 'Collett, Mr. Sidney C Stuart' 'Rosenbaum, Miss. Edith Louise'
 'Delalic, Mr. Redjo' 'Andersen, Mr. Albert Karvin' 'Finoli, Mr. Luigi'
 'Deacon, Mr. Percy William'
 'Howard, Mrs. Benjamin (Ellen Truelove Arman)'
 'Andersson, Miss. Ida Augusta Margareta' 'Head, Mr. Christopher'
 'Mahon, Miss. Bridget Delia' 'Wick, Mr. George Dennick'
 'Widener, Mrs. George Dunton (Eleanor Elkins)'
 'Thomson, Mr. Alexander Morrison' 'Duran y More, Miss. Florentina'
 'Reynolds, Mr. Harold J' 'Cook, Mrs. (Selena Rogers)'
 'Karlsson, Mr. Einar Gervasius'
 'Candee, Mrs. Edward (Helen Churchill Hungerford)'
 'Moubarek, Mrs. George (Omine Amenia" Alexander)"'
 'Asplund, Mr. Johan Charles' 'McNeill, Miss. Bridget'
 'Everett, Mr. Thomas James' 'Hocking, Mr. Samuel James Metcalfe'
 'Sweet, Mr. George Frederick' 'Willard, Miss. Constance'
 'Wiklund, Mr. Karl Johan' 'Linehan, Mr. Michael'
 'Cumings, Mr. John Bradley' 'Vendel, Mr. Olof Edvin'
 'Warren, Mr. Frank Manley' 'Baccos, Mr. Raffull' 'Hiltunen, Miss. Marta'
 'Douglas, Mrs. Walter Donald (Mahala Dutton)'
 'Lindstrom, Mrs. Carl Johan (Sigrid Posse)'
 'Christy, Mrs. (Alice Frances)' 'Spedden, Mr. Frederic Oakley'
 'Hyman, Mr. Abraham' 'Johnston, Master. William Arthur Willie""'
 'Kenyon, Mr. Frederick R' 'Karnes, Mrs. J Frank (Claire Bennett)'
 'Drew, Mr. James Vivian' 'Hold, Mrs. Stephen (Annie Margaret Hill)'
 'Khalil, Mrs. Betros (Zahie Maria" Elias)"' 'West, Miss. Barbara J'
 'Abrahamsson, Mr. Abraham August Johannes' 'Clark, Mr. Walter Miller'
 'Salander, Mr. Karl Johan' 'Wenzel, Mr. Linhart'
 'MacKay, Mr. George William' 'Mahon, Mr. John' 'Niklasson, Mr. Samuel'
 'Bentham, Miss. Lilian W' 'Midtsjo, Mr. Karl Albert'
 'de Messemaeker, Mr. Guillaume Joseph' 'Nilsson, Mr. August Ferdinand'
 'Wells, Mrs. Arthur Henry (Addie" Dart Trevaskis)"'
 'Klasen, Miss. Gertrud Emilia' 'Portaluppi, Mr. Emilio Ilario Giuseppe'
 'Lyntakoff, Mr. Stanko' 'Chisholm, Mr. Roderick Robert Crispin'
 'Warren, Mr. Charles William' 'Howard, Miss. May Elizabeth'
 'Pokrnic, Mr. Mate' 'McCaffry, Mr. Thomas Francis' 'Fox, Mr. Patrick'
 'Clark, Mrs. Walter Miller (Virginia McDowell)' 'Lennon, Miss. Mary'
 'Saade, Mr. Jean Nassr' 'Bryhl, Miss. Dagmar Jenny Ingeborg '
 'Parker, Mr. Clifford Richard' 'Faunthorpe, Mr. Harry'
 'Ware, Mr. John James' 'Oxenham, Mr. Percy Thomas'
 'Oreskovic, Miss. Jelka' 'Peacock, Master. Alfred Edward'
 'Fleming, Miss. Honora' 'Touma, Miss. Maria Youssef'
 'Rosblom, Miss. Salli Helena' 'Dennis, Mr. William'
 'Franklin, Mr. Charles (Charles Fardon)' 'Snyder, Mr. John Pillsbury'
 'Mardirosian, Mr. Sarkis' 'Ford, Mr. Arthur'
 'Rheims, Mr. George Alexander Lucien'
 'Daly, Miss. Margaret Marcella Maggie""' 'Nasr, Mr. Mustafa'
 'Dodge, Dr. Washington' 'Wittevrongel, Mr. Camille'
 'Angheloff, Mr. Minko' 'Laroche, Miss. Louise' 'Samaan, Mr. Hanna'
 'Loring, Mr. Joseph Holland' 'Johansson, Mr. Nils'
 'Olsson, Mr. Oscar Wilhelm' 'Malachard, Mr. Noel'
 'Phillips, Mr. Escott Robert' 'Pokrnic, Mr. Tome'
 'McCarthy, Miss. Catherine Katie""'
 'Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)'
 'Allison, Mr. Hudson Joshua Creighton' 'Aks, Master. Philip Frank'
 'Hays, Mr. Charles Melville' 'Hansen, Mrs. Claus Peter (Jennie L Howard)'
 'Cacic, Mr. Jego Grga' 'Vartanian, Mr. David' 'Sadowitz, Mr. Harry'
 'Carr, Miss. Jeannie' 'White, Mrs. John Stuart (Ella Holmes)'
 'Hagardon, Miss. Kate' 'Spencer, Mr. William Augustus'
 'Rogers, Mr. Reginald Harry' 'Jonsson, Mr. Nils Hilding'
 'Jefferys, Mr. Ernest Wilfred' 'Andersson, Mr. Johan Samuel'
 'Krekorian, Mr. Neshan' 'Nesson, Mr. Israel' 'Rowe, Mr. Alfred G'
 'Kreuchen, Miss. Emilie' 'Assam, Mr. Ali' 'Becker, Miss. Ruth Elizabeth'
 'Rosenshine, Mr. George (Mr George Thorne")"'
 'Clarke, Mr. Charles Valentine' 'Enander, Mr. Ingvar'
 'Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) '
 'Dulles, Mr. William Crothers' 'Thomas, Mr. Tannous'
 'Nakid, Mrs. Said (Waika Mary" Mowad)"' 'Cor, Mr. Ivan'
 'Maguire, Mr. John Edward' 'de Brito, Mr. Jose Joaquim'
 'Elias, Mr. Joseph' 'Denbury, Mr. Herbert' 'Betros, Master. Seman'
 'Fillbrook, Mr. Joseph Charles' 'Lundstrom, Mr. Thure Edvin'
 'Sage, Mr. John George'
 'Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)'
 'van Billiard, Master. James William' 'Abelseth, Miss. Karen Marie'
 'Botsford, Mr. William Hull'
 'Whabee, Mrs. George Joseph (Shawneene Abi-Saab)' 'Giles, Mr. Ralph'
 'Walcroft, Miss. Nellie' 'Greenfield, Mrs. Leo David (Blanche Strouse)'
 'Stokes, Mr. Philip Joseph' 'Dibden, Mr. William' 'Herman, Mr. Samuel'
 'Dean, Miss. Elizabeth Gladys Millvina""' 'Julian, Mr. Henry Forbes'
 'Brown, Mrs. John Murray (Caroline Lane Lamson)' 'Lockyer, Mr. Edward'
 "O'Keefe, Mr. Patrick"
 'Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)'
 'Sage, Master. William Henry' 'Mallet, Mrs. Albert (Antoinette Magnin)'
 'Ware, Mrs. John James (Florence Louise Long)' 'Strilic, Mr. Ivan'
 'Harder, Mrs. George Achilles (Dorothy Annan)'
 'Sage, Mrs. John (Annie Bullen)' 'Caram, Mr. Joseph'
 'Riihivouri, Miss. Susanna Juhantytar Sanni""'
 'Gibson, Mrs. Leonard (Pauline C Boeson)' 'Pallas y Castello, Mr. Emilio'
 'Giles, Mr. Edgar' 'Wilson, Miss. Helen Alice' 'Ismay, Mr. Joseph Bruce'
 'Harbeck, Mr. William H' 'Dodge, Mrs. Washington (Ruth Vidaver)'
 'Bowen, Miss. Grace Scott' 'Kink, Miss. Maria'
 'Cotterill, Mr. Henry Harry""' 'Hipkins, Mr. William Edward'
 'Asplund, Master. Carl Edgar' "O'Connor, Mr. Patrick" 'Foley, Mr. Joseph'
 'Risien, Mrs. Samuel (Emma)' "McNamee, Mrs. Neal (Eileen O'Leary)"
 'Wheeler, Mr. Edwin Frederick""' 'Herman, Miss. Kate'
 'Aronsson, Mr. Ernst Axel Algot' 'Ashby, Mr. John' 'Canavan, Mr. Patrick'
 'Palsson, Master. Paul Folke' 'Payne, Mr. Vivian Ponsonby'
 'Lines, Mrs. Ernest H (Elizabeth Lindsey James)'
 'Abbott, Master. Eugene Joseph' 'Gilbert, Mr. William'
 'Kink-Heilmann, Mr. Anton'
 'Smith, Mrs. Lucien Philip (Mary Eloise Hughes)' 'Colbert, Mr. Patrick'
 'Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)'
 'Larsson-Rondberg, Mr. Edvard A' 'Conlon, Mr. Thomas Henry'
 'Bonnell, Miss. Caroline' 'Gale, Mr. Harry'
 'Gibson, Miss. Dorothy Winifred' 'Carrau, Mr. Jose Pedro'
 'Frauenthal, Mr. Isaac Gerald'
 'Nourney, Mr. Alfred (Baron von Drachstedt")"'
 'Ware, Mr. William Jeffery' 'Widener, Mr. George Dunton'
 'Riordan, Miss. Johanna Hannah""' 'Peacock, Miss. Treasteall'
 'Naughton, Miss. Hannah'
 'Minahan, Mrs. William Edward (Lillian E Thorpe)'
 'Henriksson, Miss. Jenny Lovisa' 'Spector, Mr. Woolf'
 'Oliva y Ocana, Dona. Fermina' 'Saether, Mr. Simon Sivertsen'
 'Ware, Mr. Frederick' 'Peter, Master. Michael J']

 ------------------------------------------------------------ 

Name of Column: Sex
['male' 'female']

 ------------------------------------------------------------ 

Name of Column: Ticket
['330911' '363272' '240276' '315154' '3101298' '7538' '330972' '248738'
 '2657' 'A/4 48871' '349220' '694' '21228' '24065' 'W.E.P. 5734'
 'SC/PARIS 2167' '233734' '2692' 'STON/O2. 3101270' '2696' 'PC 17603'
 'C 17368' 'PC 17598' 'PC 17597' 'PC 17608' 'A/5. 3337' '113509' '2698'
 '113054' '2662' 'SC/AH 3085' 'C.A. 31029' 'C.A. 2315' 'W./C. 6607'
 '13236' '2682' '342712' '315087' '345768' '1601' '349256' '113778'
 'SOTON/O.Q. 3101263' '237249' '11753' 'STON/O 2. 3101291' 'PC 17594'
 '370374' '11813' 'C.A. 37671' '13695' 'SC/PARIS 2168' '29105' '19950'
 'SC/A.3 2861' '382652' '349230' '348122' '386525' '349232' '237216'
 '347090' '334914' 'F.C.C. 13534' '330963' '113796' '2543' '382653'
 '349211' '3101297' 'PC 17562' '113503' '359306' '11770' '248744' '368702'
 '2678' 'PC 17483' '19924' '349238' '240261' '2660' '330844' 'A/4 31416'
 '364856' '29103' '347072' '345498' 'F.C. 12750' '376563' '13905' '350033'
 '19877' 'STON/O 2. 3101268' '347471' 'A./5. 3338' '11778' '228414'
 '365235' '347070' '2625' 'C 4001' '330920' '383162' '3410' '248734'
 '237734' '330968' 'PC 17531' '329944' '2680' '2681' 'PP 9549' '13050'
 'SC/AH 29037' 'C.A. 33595' '367227' '392095' '368783' '371362' '350045'
 '367226' '211535' '342441' 'STON/OQ. 369943' '113780' '4133' '2621'
 '349226' '350409' '2656' '248659' 'SOTON/OQ 392083' 'CA 2144' '113781'
 '244358' '17475' '345763' '17463' 'SC/A4 23568' '113791' '250651' '11767'
 '349255' '3701' '350405' '347077' 'S.O./P.P. 752' '347469' '110489'
 'SOTON/O.Q. 3101315' '335432' '2650' '220844' '343271' '237393' '315153'
 'PC 17591' 'W./C. 6608' '17770' '7548' 'S.O./P.P. 251' '2670' '2673'
 '29750' 'C.A. 33112' '230136' 'PC 17756' '233478' '113773' '7935'
 'PC 17558' '239059' 'S.O./P.P. 2' 'A/4 48873' 'CA. 2343' '28221' '226875'
 '111163' 'A/5. 851' '235509' '28220' '347465' '16966' '347066'
 'C.A. 31030' '65305' '36568' '347080' 'PC 17757' '26360' 'C.A. 34050'
 'F.C. 12998' '9232' '28034' 'PC 17613' '349250' 'SOTON/O.Q. 3101308'
 'S.O.C. 14879' '347091' '113038' '330924' '36928' '32302' 'SC/PARIS 2148'
 '342684' 'W./C. 14266' '350053' 'PC 17606' '2661' '350054' '370368'
 'C.A. 6212' '242963' '220845' '113795' '3101266' '330971' 'PC 17599'
 '350416' '110813' '2679' '250650' 'PC 17761' '112377' '237789' '3470'
 '17464' '26707' 'C.A. 34651' 'SOTON/O2 3101284' '13508' '7266' '345775'
 'C.A. 42795' 'AQ/4 3130' '363611' '28404' '345501' '345572' '350410'
 'C.A. 34644' '349235' '112051' 'C.A. 49867' 'A. 2. 39186' '315095'
 '368573' '370371' '2676' '236853' 'SC 14888' '2926' 'CA 31352'
 'W./C. 14260' '315085' '364859' '370129' 'A/5 21175' 'SOTON/O.Q. 3101314'
 '2655' 'A/5 1478' 'PC 17607' '382650' '2652' '33638' '345771' '349202'
 'SC/Paris 2123' '113801' '347467' '347079' '237735' '315092' '383123'
 '112901' '392091' '12749' '350026' '315091' '2658' 'LP 1588' '368364'
 'PC 17760' 'AQ/3. 30631' 'PC 17569' '28004' '350408' '347075' '2654'
 '244368' '113790' '24160' 'SOTON/O.Q. 3101309' 'PC 17585' '2003' '236854'
 'PC 17580' '2684' '2653' '349229' '110469' '244360' '2675' '2622'
 'C.A. 15185' '350403' 'PC 17755' '348125' '237670' '2688' '248726'
 'F.C.C. 13528' 'PC 17759' 'F.C.C. 13540' '113044' '11769' '1222' '368402'
 '349910' 'S.C./PARIS 2079' '315083' '11765' '2689' '3101295' '112378'
 'SC/PARIS 2147' '28133' '112058' '248746' '315152' '29107' '680' '366713'
 '330910' '364498' '376566' 'SC/PARIS 2159' '349911' '244346' '364858'
 '349909' 'PC 17592' 'C.A. 2673' 'C.A. 30769' '371109' '13567' '347065'
 '21332' '28664' '113059' '17765' 'SC/PARIS 2166' '28666' '334915'
 '365237' '19928' '347086' 'A.5. 3236' 'PC 17758' 'SOTON/O.Q. 3101262'
 '359309' '2668']

 ------------------------------------------------------------ 

Name of Column: Cabin
['Unknown' 'B45' 'E31' 'B57 B59 B63 B66' 'B36' 'A21' 'C78' 'D34' 'D19'
 'A9' 'D15' 'C31' 'C23 C25 C27' 'F G63' 'B61' 'C53' 'D43' 'C130' 'C132'
 'C101' 'C55 C57' 'B71' 'C46' 'C116' 'F' 'A29' 'G6' 'C6' 'C28' 'C51' 'E46'
 'C54' 'C97' 'D22' 'B10' 'F4' 'E45' 'E52' 'D30' 'B58 B60' 'E34' 'C62 C64'
 'A11' 'B11' 'C80' 'F33' 'C85' 'D37' 'C86' 'D21' 'C89' 'F E46' 'A34' 'D'
 'B26' 'C22 C26' 'B69' 'C32' 'B78' 'F E57' 'F2' 'A18' 'C106' 'B51 B53 B55'
 'D10 D12' 'E60' 'E50' 'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D28' 'B41'
 'C7' 'D40' 'D38' 'C105']

 ------------------------------------------------------------ 

Name of Column: Embarked
['Q' 'S' 'C']

 ------------------------------------------------------------ 

📊 Insights:

  • The following columns: Age, Fare, and Cabin had null values in the dataset
  • The null values in Age and Fare column were filled with median instead of mean due to the presence of outliers
  • The null values in Cabin column were filled with Unknown
  • Later, we checked the unique values inside Categorical Columns to see if there are any typos or useful information

🔎 Feature Engineering 🔎

In [9]:
titanic.head()
Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 Unknown Q
1 893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 Unknown S
2 894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 Unknown Q
3 895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 Unknown S
4 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 Unknown S
In [10]:
# Creating a new feature of title from name column based on the pattern found above

titanic['Title'] = titanic['Name'].str.extract(r',\s(.*?)\.')

titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Dona', 'Mrs')
titanic['Title'] = titanic['Title'].replace(['Col', 'Rev', 'Dr'], 'Rare')
In [11]:
# Creating another feature of Age group by making bins

bins = [-np.inf, 17, 32, 45, 50, np.inf]
labels = ["Children", "Young", "Mid-Aged", "Senior-Adult", 'Elderly']
titanic['Age_Group'] = pd.cut(titanic['Age'], bins = bins, labels = labels)
In [12]:
# Generting another new feature of family size 

titanic['Family'] = titanic['SibSp'] + titanic['Parch']
In [13]:
# Dropping non essential coclumns

titanic.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace = True)
In [14]:
titanic.head()
Out[14]:
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked Title Age_Group Family
0 0 3 male 34.5 0 0 7.8292 Unknown Q Mr Mid-Aged 0
1 1 3 female 47.0 1 0 7.0000 Unknown S Mrs Senior-Adult 1
2 0 2 male 62.0 0 0 9.6875 Unknown Q Mr Elderly 0
3 0 3 male 27.0 0 0 8.6625 Unknown S Mr Young 0
4 1 3 female 22.0 1 1 12.2875 Unknown S Mrs Young 2
In [15]:
# Chaning the positon of columns to place them right after their parent column

col_to_move = titanic.pop('Age_Group')
titanic.insert(4, 'Age_Group', col_to_move)

col_to_move = titanic.pop('Family')
titanic.insert(7, 'Family', col_to_move)

titanic['Age_Group'] = titanic['Age_Group'].astype('object')

📊 Insights:

  • Following of the 3 new features were created: Title, Age_Group, and Family
  • Next, positions of these new columns were changed and their data type as well

📊 Exploratory Data Analysis 📊


Descriptive Analysis¶


In [16]:
titanic.describe()
Out[16]:
Survived Pclass Age SibSp Parch Family Fare
count 418.000000 418.000000 418.000000 418.000000 418.000000 418.000000 418.000000
mean 0.363636 2.265550 29.599282 0.447368 0.392344 0.839713 35.576535
std 0.481622 0.841838 12.703770 0.896760 0.981429 1.519072 55.850103
min 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 1.000000 23.000000 0.000000 0.000000 0.000000 7.895800
50% 0.000000 3.000000 27.000000 0.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 35.750000 1.000000 0.000000 1.000000 31.471875
max 1.000000 3.000000 76.000000 8.000000 9.000000 10.000000 512.329200
In [17]:
titanic.describe(include = 'O')
Out[17]:
Sex Age_Group Cabin Embarked Title
count 418 418 418 418 418
unique 2 5 77 3 5
top male Young Unknown S Mr
freq 266 257 327 270 240
In [18]:
titanic.groupby('Sex')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()
Out[18]:
Survived Pclass Age SibSp Parch Family Fare
Sex
female 1.0 2.144737 29.734145 0.565789 0.598684 1.164474 49.747699
male 0.0 2.334586 29.522218 0.379699 0.274436 0.654135 27.478728
In [19]:
titanic.groupby('Embarked')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()
Out[19]:
Survived Pclass Age SibSp Parch Family Fare
Embarked
C 0.392157 1.794118 33.220588 0.421569 0.382353 0.803922 66.259765
Q 0.521739 2.869565 28.108696 0.195652 0.021739 0.217391 10.957700
S 0.325926 2.340741 28.485185 0.500000 0.459259 0.959259 28.179413

📊 Insights:

  • The analysis revealed that mostly people are: Young male who have traveled more Southampton
  • Females are more likely to travel with someone and pay high fares
  • Furthermore, people embarked from Cherbourg have an average age of 33 and fares to pay around 66 pounds

Univariate Analysis¶


In [20]:
survived_counts = titanic['Survived'].value_counts()
fig_surv_perc = px.pie(titanic, names= survived_counts.index,  values = survived_counts.values, title=f'Distribution of Survived', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_surv_perc.update_traces(textinfo='percent+label')
fig_surv_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_surv_perc.show()
In [21]:
pclass_counts = titanic.Pclass.value_counts()
fig_pclass_perc = px.pie(titanic, names= pclass_counts.index, values = pclass_counts.values, title=f'Distribution of Pclass', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_perc.update_traces(textinfo='percent+label')
fig_pclass_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_pclass_perc.show()
In [22]:
fig_sex_count = px.histogram(titanic, x = 'Sex', color = 'Sex', color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_count.update_layout(title_text='Count of different Sex', xaxis_title='Sex', yaxis_title='Count', plot_bgcolor = 'white')
fig_sex_count.show()

fig_sex_perc = px.pie(titanic, names= 'Sex', title=f'Distribution of Sex', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_perc.update_traces(textinfo='percent+label')
fig_sex_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_sex_perc.show()
In [23]:
fig_age = px.histogram(titanic, x='Age', nbins=30, histnorm='probability density')
fig_age.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_age.update_layout(title='Distribution of Age', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Age', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_age.show()
In [24]:
fig_fare = px.histogram(titanic, x='Fare', nbins=30, histnorm='probability density')
fig_fare.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_fare.update_layout(title='Distribution of Fare', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Fare', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_fare.show()
In [25]:
fig_embarked_count = px.histogram(titanic, x = 'Embarked', color = 'Embarked', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_count.update_layout(title_text='Count of different Embarked', xaxis_title='Embarked', yaxis_title='Count', plot_bgcolor = 'white')
fig_embarked_count.show()

fig_embarked_perc = px.pie(titanic, names= 'Embarked', title=f'Distribution of Embarked', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_perc.update_traces(textinfo='percent+label')
fig_embarked_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_embarked_perc.show()
In [26]:
fig_title_count = px.histogram(titanic, x = 'Title', color = 'Title', color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_count.update_layout(title_text='Count of different Title', xaxis_title='Title', yaxis_title='Count', plot_bgcolor = 'white')
fig_title_count.show()

fig_title_perc = px.pie(titanic, names= 'Title', title=f'Distribution of Title', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_perc.update_traces(textinfo='percent+label')
fig_title_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_title_perc.show()

📊 Insights:

  • Only 36.4% of the people survived the crash

  • The dataset also have a high distribution of poeple from Pclass = 3, and high ratio of males

  • The distribution of age is centered around 25-29, and fare is around 10-30

  • Most of the people are embarked from Southampton, and mostly the title holded by passengers are Mr. = Single Male


Bivariate Analysis¶


In [27]:
fig_pclass_surv = px.histogram(titanic, x = 'Pclass', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to passenger classes', plot_bgcolor = 'white')
fig_pclass_surv.show()
In [28]:
fig_pclass_surv = px.histogram(titanic, x = 'Sex', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to gender', plot_bgcolor = 'white')
fig_pclass_surv.show()
In [29]:
fig_embarked_surv = px.histogram(titanic, x = 'Age_Group', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to age groups', plot_bgcolor = 'white')
fig_embarked_surv.show()
In [30]:
fig_family_surv = px.histogram(titanic, x = 'Family', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_family_surv.update_layout(title = 'Survival according to number of family members', plot_bgcolor = 'white')
fig_family_surv.show()
In [31]:
fig_embarked_surv = px.histogram(titanic, x = 'Embarked', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to embarked', plot_bgcolor = 'white')
fig_embarked_surv.show()

📊 Insights:

  • The least deaths are from Pclass = 1 and the highest number of deaths are from Pclass = 3

  • The dataset also have a high distribution of poeple from Pclass = 3, and high ratio of males

  • None of the male survived, and all the females survived

  • The highest death count is from Young Age Group, and Elderly People have a good survival count

  • Poeple with few family members are more likely to survive according to analysis

  • A high ratio of poeple who embarked from Queenstown survived, and Southampton has the highest death casualities


Multivariate Analysis¶


In [32]:
grouped_data = titanic.groupby(['Age', 'Sex', 'Survived']).agg({'Fare': 'mean'}).reset_index()
fig = px.line(grouped_data, x='Age', y='Fare', color='Survived', facet_col='Sex', facet_col_wrap=2, labels={'Fare': 'Fare', 'Survived': 'Survived'}, title='12. Relation of age and gender with fare')

fig.update_layout(hovermode='x unified', plot_bgcolor = 'white')
fig.update_xaxes(title_text='Age')
fig.update_yaxes(title_text='Fair', row=1, col=1)
fig.show()

📊 Insights:

  • The analysis revealed that Fare is a bit high for females compared to males, and the Fare is likely to increase according to Age for females. Overall, Age doesn't have a significant impact on survival

⚙️ Data Preprocessing ⚙️


1. Label Encoding¶


In [33]:
# Labeling the ordinal variables

le = LabelEncoder()
cols = ['Sex', 'Age_Group', 'Cabin', 'Embarked', 'Title']

for col in cols:
    titanic[col] = le.fit_transform(titanic[col])

2. Class Imbalance¶


In [34]:
# Checking the class count for target variable

titanic.Survived.value_counts()
Out[34]:
Survived
0    266
1    152
Name: count, dtype: int64
In [35]:
X = titanic.drop('Survived', axis = 1)
y = titanic['Survived']
In [36]:
# Using the SMOTE technique to handle class imbalance

smote = SMOTE(random_state = 42)
X_balanced, y_balanced = smote.fit_resample(X, y)

3. Splitting into training and testing¶


In [37]:
# Splitting the dataset into training and testing parts

X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size = 0.3, random_state = 42) 

4. Feature Scaling¶


In [38]:
# Doing feature scaling by StandardScaler

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

🎯 Model Building 🎯

In [39]:
# Building the models

lr = LogisticRegression()
rf = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)
gbc.fit(X_train_scaled, y_train)

lr_pred = lr.predict(X_test_scaled)
rf_pred = rf.predict(X_test_scaled)
gbc_pred = gbc.predict(X_test_scaled)

⚡ Model Evaluation ⚡

In [40]:
# Evaluating the models by generating classification report and cross validation scores

lr_report = classification_report(y_test, lr_pred)
lr_scores = cross_val_score(lr, X_train_scaled, y_train, cv=5, scoring='accuracy')

rf_report = classification_report(y_test, rf_pred)
rf_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='accuracy')

gbc_report = classification_report(y_test, gbc_pred)
gbc_scores = cross_val_score(gbc, X_train_scaled, y_train, cv=5, scoring='accuracy')


print('The classification report of Logistic Regression is below : ', '\n\n\n', lr_report)
print(f"Logistic Regression Mean Cross-Validation Score: {lr_scores}")

print('\n', '='*100, '\n')
print('The classification report of Random Forest is below : ', '\n\n\n', rf_report)
print(f"Random Forest Mean Cross-Validation Score: {rf_scores}")

print('\n', '='*100, '\n')
print('The classification report of Gradient Bossting Classifier is below : ', '\n\n\n', rf_report)
print(f"Gradient Boosting Classifier Mean Cross-Validation Score: {gbc_scores}")
The classification report of Logistic Regression is below :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

Logistic Regression Mean Cross-Validation Score: [1. 1. 1. 1. 1.]

 ==================================================================================================== 

The classification report of Random Forest is below :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

Random Forest Mean Cross-Validation Score: [1. 1. 1. 1. 1.]

 ==================================================================================================== 

The classification report of Gradient Bossting Classifier is below :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

Gradient Boosting Classifier Mean Cross-Validation Score: [1. 1. 1. 1. 1.]

🎈 Conclusion 🎈

Conclusion:


In this Titanic Survival Prediction analysis, we have explored various aspects of the dataset to understand the factors influencing survival. We found that only 36.4% of the passengers survived the crash, with significant differences in survival rates among different passenger classes, genders, and age groups. The dataset also revealed that certain features, such as Fare and embarkation location, played a role in survival. We trained several classification models to predict survival, all of which performed well, likely due to the relatively small dataset size.


Insights:

Our analysis unveiled key insights into the Titanic dataset. We addressed missing values by filling null entries in the Age and Fare columns with medians due to the presence of outliers, while the Cabin column was filled with "Unknown." New features, including Title, Age_Group, and Family, were created to enhance our understanding of passenger demographics. We discovered that young males traveling from Southampton constituted the majority, and females were more likely to travel with others and pay higher fares. Notably, passengers from Cherbourg had an average age of 33 and paid around 66 pounds in fares. Furthermore, we observed that Pclass 3 had the highest number of deaths, with no surviving males and all females surviving. Family size appeared to influence survival, and passengers from Queenstown had a higher survival rate compared to those from Southampton.


What's next?

For future analysis, it would be beneficial to explore more advanced machine learning techniques and consider feature engineering to improve model performance further. Additionally, investigating the impact of other variables not included in this analysis, such as cabin location and passenger demographics beyond age, gender, and family size, could provide deeper insights. Further exploration of the dataset and refining models could enhance our ability to predict Titanic passenger survival more accurately.


thanks.jpg